Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order keys for s3 #96

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Order keys for s3 #96

wants to merge 2 commits into from

Conversation

btalbot
Copy link

@btalbot btalbot commented May 9, 2017

As reported in #90 there is a bug with key ordering when computing S3 keys. The bug reporter submitted a PR #91 which was never accepted. Maybe the use of ES6 features or lack of tests to demonstrate the bug was at issue.

This PR updates the existing tests to use a range key so that the issue is obvious and an ES5 compatible fix to the incremental and backfill functions that are affected.

btalbot added 2 commits May 8, 2017 19:36
This exposes a bug (previously filed as mapbox#90) which occurs when items with a range key are read
from the DDB event stream, an md5 hash of the key is computed and the item written to S3.

The issue is that the DDB event stream handler does not (and should not) do a 'describe_table' to
know which key is the HASH and which is the RANGE and therefor simply generates the md5 hash of
the item keys in whatever order they happen to appear in the stream event.
The s3-backfill util does do a 'describe_table' and does order the keys by declaration
order which DDB requires to be HASH first, RANGE second.

The different ordering of the item keys will produce a distinct md5 hash value and different S3
path/keys will result in some items appearing twice in S3 effectively corrupting the incremental
backups since two valid versions will be present at the same time.
…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass
@btalbot
Copy link
Author

btalbot commented May 9, 2017

In case it's not clear, this bug causes incremental backups to become corrupted if the order of keys from the event stream ever change or differ from the declaration order used in s3-backfill.

The incremental backup directory will be left with up to two copies of the same DDB item but with different S3 keys. Any snapshots created from the corrupted incremental backup will contain both copies of items; if used to restore a DDB table, both items for the same keys will be loaded with the last one loaded being kept (which is not the same as the desired most recent).

This cannot happen with HASH keyed only items, but any table-stream with a RANGE key is susceptible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant